Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING

Identifieur interne : 000061 ( Main/Exploration ); précédent : 000060; suivant : 000062

ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING

Auteurs : Marthe Lagarrigue [France] ; Florence Rossant [France] ; Alain Pierrot [France] ; Joël Gardes [France] ; Christophe Maldivi [France] ; Eric Petit [France]

Source :

RBID : Hal:hal-01075265

English descriptors

Abstract

Digitized re-publishing of documents has become nowadays a very important issue. Optical Character Recognition (OCR) has been intensively used to this aim, as it performs the transcription of the text images into electronic files, allowing display functionalities, indexation, enrichment and broadcasting. However, such software still fails in many configurations, so that the transcription does not reach the required editorial quality (99% of recognition are required for an ergonomic reading). In the OZALID project, we propose to rely on crowdsourcing for correcting OCR results. One main issue is then to determine when the crowdsourcing has reached its limits. For that, we present a feasibility study of an original protocol based on indicators that quantify the recognition quality in both semantic and semiotic ways. These indicators are calculated and followed up during the entire crowdsourcing process until stability. Experimental results show that the proposed observables converge after some correction iterations allowing automatically stopping the crowdsourcing process and dealing with huge amount of data.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING</title>
<author>
<name sortKey="Lagarrigue, Marthe" sort="Lagarrigue, Marthe" uniqKey="Lagarrigue M" first="Marthe" last="Lagarrigue">Marthe Lagarrigue</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-102803" status="INCOMING">
<orgName>ISEP</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-324747" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-324747" type="direct">
<org type="institution" xml:id="struct-324747" status="INCOMING">
<orgName>institut supérieur d'éléctronique de paris</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Rossant, Florence" sort="Rossant, Florence" uniqKey="Rossant F" first="Florence" last="Rossant">Florence Rossant</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-102803" status="INCOMING">
<orgName>ISEP</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-324747" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-324747" type="direct">
<org type="institution" xml:id="struct-324747" status="INCOMING">
<orgName>institut supérieur d'éléctronique de paris</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Pierrot, Alain" sort="Pierrot, Alain" uniqKey="Pierrot A" first="Alain" last="Pierrot">Alain Pierrot</name>
<affiliation wicri:level="1">
<hal:affiliation type="department" xml:id="struct-388118" status="INCOMING">
<orgName>I2S</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-388117" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-388117" type="direct">
<org type="institution" xml:id="struct-388117" status="INCOMING">
<orgName>I2S</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Gardes, Joel" sort="Gardes, Joel" uniqKey="Gardes J" first="Joël" last="Gardes">Joël Gardes</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID">
<orgName>Orange Labs [Grenoble]</orgName>
<desc>
<address>
<addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-366011" type="direct">
<org type="institution" xml:id="struct-366011" status="INCOMING">
<orgName>Orange</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Maldivi, Christophe" sort="Maldivi, Christophe" uniqKey="Maldivi C" first="Christophe" last="Maldivi">Christophe Maldivi</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID">
<orgName>Orange Labs [Grenoble]</orgName>
<desc>
<address>
<addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-366011" type="direct">
<org type="institution" xml:id="struct-366011" status="INCOMING">
<orgName>Orange</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Petit, Eric" sort="Petit, Eric" uniqKey="Petit E" first="Eric" last="Petit">Eric Petit</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID">
<orgName>Orange Labs [Grenoble]</orgName>
<desc>
<address>
<addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-366011" type="direct">
<org type="institution" xml:id="struct-366011" status="INCOMING">
<orgName>Orange</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01075265</idno>
<idno type="halId">hal-01075265</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01075265</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01075265</idno>
<date when="2014-11-01">2014-11-01</date>
<idno type="wicri:Area/Hal/Corpus">000016</idno>
<idno type="wicri:Area/Hal/Curation">000016</idno>
<idno type="wicri:Area/Hal/Checkpoint">000021</idno>
<idno type="wicri:Area/Main/Merge">000061</idno>
<idno type="wicri:Area/Main/Curation">000061</idno>
<idno type="wicri:Area/Main/Exploration">000061</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING</title>
<author>
<name sortKey="Lagarrigue, Marthe" sort="Lagarrigue, Marthe" uniqKey="Lagarrigue M" first="Marthe" last="Lagarrigue">Marthe Lagarrigue</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-102803" status="INCOMING">
<orgName>ISEP</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-324747" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-324747" type="direct">
<org type="institution" xml:id="struct-324747" status="INCOMING">
<orgName>institut supérieur d'éléctronique de paris</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Rossant, Florence" sort="Rossant, Florence" uniqKey="Rossant F" first="Florence" last="Rossant">Florence Rossant</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-102803" status="INCOMING">
<orgName>ISEP</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-324747" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-324747" type="direct">
<org type="institution" xml:id="struct-324747" status="INCOMING">
<orgName>institut supérieur d'éléctronique de paris</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Pierrot, Alain" sort="Pierrot, Alain" uniqKey="Pierrot A" first="Alain" last="Pierrot">Alain Pierrot</name>
<affiliation wicri:level="1">
<hal:affiliation type="department" xml:id="struct-388118" status="INCOMING">
<orgName>I2S</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-388117" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-388117" type="direct">
<org type="institution" xml:id="struct-388117" status="INCOMING">
<orgName>I2S</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Gardes, Joel" sort="Gardes, Joel" uniqKey="Gardes J" first="Joël" last="Gardes">Joël Gardes</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID">
<orgName>Orange Labs [Grenoble]</orgName>
<desc>
<address>
<addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-366011" type="direct">
<org type="institution" xml:id="struct-366011" status="INCOMING">
<orgName>Orange</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Maldivi, Christophe" sort="Maldivi, Christophe" uniqKey="Maldivi C" first="Christophe" last="Maldivi">Christophe Maldivi</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID">
<orgName>Orange Labs [Grenoble]</orgName>
<desc>
<address>
<addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-366011" type="direct">
<org type="institution" xml:id="struct-366011" status="INCOMING">
<orgName>Orange</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Petit, Eric" sort="Petit, Eric" uniqKey="Petit E" first="Eric" last="Petit">Eric Petit</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID">
<orgName>Orange Labs [Grenoble]</orgName>
<desc>
<address>
<addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-366011" type="direct">
<org type="institution" xml:id="struct-366011" status="INCOMING">
<orgName>Orange</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Digital edition</term>
<term>OCR</term>
<term>correction protocol</term>
<term>crowdsourcing</term>
<term>quality assessment</term>
<term>semantics</term>
<term>semiotics</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Digitized re-publishing of documents has become nowadays a very important issue. Optical Character Recognition (OCR) has been intensively used to this aim, as it performs the transcription of the text images into electronic files, allowing display functionalities, indexation, enrichment and broadcasting. However, such software still fails in many configurations, so that the transcription does not reach the required editorial quality (99% of recognition are required for an ergonomic reading). In the OZALID project, we propose to rely on crowdsourcing for correcting OCR results. One main issue is then to determine when the crowdsourcing has reached its limits. For that, we present a feasibility study of an original protocol based on indicators that quantify the recognition quality in both semantic and semiotic ways. These indicators are calculated and followed up during the entire crowdsourcing process until stability. Experimental results show that the proposed observables converge after some correction iterations allowing automatically stopping the crowdsourcing process and dealing with huge amount of data.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
</list>
<tree>
<country name="France">
<noRegion>
<name sortKey="Lagarrigue, Marthe" sort="Lagarrigue, Marthe" uniqKey="Lagarrigue M" first="Marthe" last="Lagarrigue">Marthe Lagarrigue</name>
</noRegion>
<name sortKey="Gardes, Joel" sort="Gardes, Joel" uniqKey="Gardes J" first="Joël" last="Gardes">Joël Gardes</name>
<name sortKey="Maldivi, Christophe" sort="Maldivi, Christophe" uniqKey="Maldivi C" first="Christophe" last="Maldivi">Christophe Maldivi</name>
<name sortKey="Petit, Eric" sort="Petit, Eric" uniqKey="Petit E" first="Eric" last="Petit">Eric Petit</name>
<name sortKey="Pierrot, Alain" sort="Pierrot, Alain" uniqKey="Pierrot A" first="Alain" last="Pierrot">Alain Pierrot</name>
<name sortKey="Rossant, Florence" sort="Rossant, Florence" uniqKey="Rossant F" first="Florence" last="Rossant">Florence Rossant</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000061 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000061 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-01075265
   |texte=   ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024